Introduction to R
Module 1
The R Language
Ecosystem
- Easy and intuitive to learn
- Robust quantitative finance packages
- Robust statistical & econometric packages
- State-of-art visualizations
- Fast and efficient modern packages
- Versatile
- Python (reticulate), C++ (rcpp)
- ML (keras, h2o), Apache Spark (sparklyr), Arrow (arrow)
Great visualizations
Fast statistical packages
Excellent syntax integration
Write once, run everywhere: dplyr syntax that can be used in:
arrowfor fast out-of-core datadbplyrfor databases (duckdb, SQLite, etc.)sparklyrfor spark integrationduckplyrfor duckdb backendand others (
dtplyr,tidypolars)
Introduction to File Systems
What is a Path?
A path is an address of file or folder
Types of paths:
- Absolute path: exact location from the root
- Relative path: relative location from current working directory
Absolute Paths
Begins from the root directory (
/in Mac/Linux,C:\in Windows)- Mac/Linux:
/Users/username/Documents/project/data.csv - Windows:
C:\Users\username\Documents\project\data.csv
- Mac/Linux:
Unambiguous
Not dependent on the current working directory
OS dependent (Windows vs macOS & Linux)
Relative Paths
The location relative to the
current working directory.- Example:
data/project/data.csv
- Example:
Portable across systems
Easier to manage in projects
Working Directory
The working directory is the folder where R is located
getwd()shows the current working directorysetwd("/path/to/directory")to change
Home Directory
- The “base” directory for user in operating system
- Used to store personal files
Windows: username/Documents folder
- The default home directory for R is specified at
C:\Users\<username>\Documents. johnhome directory would beC:\Users\john\Documents.
Mac / Linux : username folder
/Users/<username>- e.g.
/Users/john
Tilde ~
Represents the user’s home directory.
In R, ~ refers to
C:\Users\<username>\Documentsfor windows/Users/<username>for Mac/Linux
Example: ~/cases refers to :
C:\Users\<username>\Documents\casesfor windows/Users/john/casesfor Mac/Linux
R Basics
Basic Calculations
You can use R to do basic math calculations
Concatenate multiple elements
To concatenate multiple elements to a vector:
use c() function:
An (atomic) vector in R is an ordered pair of multiple elements.
Vectorized calculations
You can simply calculate vectorized calculations with R:
[1] 3 6 9
[1] 15 18 21 24
Vectorized operations are much faster than looping over each elements.
In R, most operations are automatically vectorized.
Assigning values
Use <- to assign a value to a symbol (name).
Here, my_number is called symbol, or name of an object.
Style guide: though you still can use =, use <- for assignment.
Use = for specifying function arguments instead.
Some IDEs (i.e. RStudio / Positron / VScode) have Alt (Option) + - as a shortcut.
R has strict rules about a syntactic name (symbol).
- It is case sensitive
- It cannot contain whitespace
- It cannot start with numbers
Error in parse(text = input): <text>:2:2: unexpected input
1: my_number_1 <- 15
2: 1_
^
You can’t use reserved words like
TRUE,NULL,if, etc.If you’d deliberately use non-syntactic names, use backtick `
Object naming conventions
Since objects cannot contain whitespace as symbol, there are two popular naming conventions.
snake_casecamelCase
It is better to make a short, self-explanatory name.
e.g. weight <- 15 is easier to understand than my_variable_quantity <- 15
R Prompt
On console:
>means: “Waiting your command”+means: “Continue command”Hit
CTRL + Cto abort in console
Object types in R
Vector type: common data type
Special type (non-vector): non-vectors
- functions, environments, etc.
Vectors are the most important family of data types in R.
Vector type
Vector is a data structure that stores multiple elements. It comes in two flavors:
- Atomic vector: all elements same type
- Generic vector: known as list, can have different types of elements
NULL is not a vector, but often serves as zero length vector.
Atomic vectors
There are four primary types of atomic vector in R, and two others.
Type of Atomic vectors
- Logical (or Boolean):
TRUE,FALSE,NA
- Integer: integer numbers
Attach L to treat the number as strict integer number.
- Double: real numbers
“numeric” is a collective term for both double and integers but often used as if it were a synonym for “double” or “real number” in practice.
- Character (or string): words, wrapped by
"or'
[1] "character"
[1] "character"
Style guide: Use double quote " for character instead of ' if possible.
- Two other types:
raw type: binary data type
complex type: complex numbers (e.g. 3 + 4i)
rarely needed in Finance
Missing values: NA
- Missing values are denoted by
NA- Not Applicable: similar to “undefined” above
- They are not identical to zero or NULL
NULLis intentional empty “placeholder” in R
Exercise
What are four primary types of atomic vector?
What are the types of
a,b,c,dbelow?
Confirm your answer with typeof().
List
List is a generic vector that is not atomic.
Atomic vector can have only one type for its elements (Double, Integer, Logical, …)
List can hold multiple data types for its member (even list itself)
Class attribute
The class attributes in R is used to define the behavior of objects with functions.
Especially important classes in Finance are:
- Date, Time
- Factors
- Dataframe
- or your own custom-built class
Example: Date/Time
Very important class in Finance.
They are built from double type atomic vector (type), but has own specific rule for uses (class).
[1] "2026-02-09"
[1] "2026-02-09 17:56:06 EST"
Check their data type:
Check their attributes: they have class attributes.
To directly access the class attribute:
Class attribute and change of behavior
For an example, see how it works with + function.
[1] "2026-02-10"
[1] "2026-02-09 17:56:07 EST"
Q: Why +1 yield different results?
A: Because they are in different classes. + 1 is inferred differently.
Class attribute gives context how it should behave with functions.
In the deep down, they are just numbers:
- Date: The value of double represents the number of days since “1970-01-01” (Unix Epoch)
- Time: the number of seconds since Unix Epoch
Exercise
Class exercise
- Execute
typeof(c(1,2,3))andtypeof(c(1L, 2L, 3L)). What’s the difference? - Assign a vector with three elements: 1,3,5, and name it as
my_first_object - Assign another object with one element: 5, name it as
MySecondObject - Multiply
my_first_objectwithMySecondObject. What do you get? - Assign a vector with your name:
my_name - What do you get when you execute
my_name + 3? Why do you get this result?
Importance of class
Since class determines the behavior of the object, it is crucial to know your data class especially performing function calls.
Calling a function means executing/applying a function.
As you cannot use add function on character and numeric.
Functions
Defining a function
You can define custom function (User-defined function) in R with the following syntax:
The function can be called in prefix form:
Functions example 1
Our first function, c() concatenates all the values and generate a single object!
Functions example 2
seq function: generates a sequence of numbers.
- It has three arguments:
fromandtoandby
- If users don’t specify the argument name, it reads input in order
- If user enters more args than the function space, it raises error.
Getting Help on Functions
All R functions are built by someone, and documentation is typically provided.
For detailed description of any function, use ? followed by the function’s name.
For example, try below code in your console:
Or, use help()
Build a Perpetuity Calculator
The present value of a perpetuity, where the cash flow grows at a constant rate g, is given by:
\[ PV_{PER} = \frac{PMT}{r - g} \]
where
- PMT is the payment or cash flow.
- r is the discount rate.
- g is the growth rate of the cash flow.
This formula applies when r > g.
Defining a function
You can design your perpetuity function in R with following syntax:
Calling a function
Let’s call the function above:
- What is the PV of perpetuity, when PMT = $10,000, r = 7% and g = 3%?
- Assign the result value of the function
- Vector can be the input (vectorized)
Exercise
Define a perpetuity calculator function,
pv_per(). What is the pv when PMT = $50,000, r = 4%, g = 0%?What is the pv when PMT = $50,000, r = 4%, but g are 1%, 2%, 3%?
Default Arguments
What happens if user doesn’t specify one argument?
Error in pv_per(10000, 0.07): argument "g" is missing, with no default
You can set default values for arguments, allowing them to be omitted when calling the function.
Anonymous function
Functions are typically named so they can be reused multiple times.
However, you can skip naming a custom function, and they are called anonymous function.
- Useful when the function is simple and called only one time.
They are not stored as objects since they do not have assigned symbols (names).
Syntactic sugar: Anonymous Function
Syntactic sugar refers to a feature in programming that makes the code simple to read or write, without adding functionality.
(Anonymous) functions can be defined with syntactic sugar:
Exercise
Convert below perpetuity function (pv_per) to anonymous function:
Syntactic sugar: Pipe Operator
A Motivating example
Solve below math problem. Describe your steps. What was the first and the last step?
\[ \sqrt{(2+4)^2 - 3 * 4} = ? \]
\[ \sqrt{(2+4)^2 - 3 * 4} \]
- Do 2+4
and thensquare it, and save it in your memory - Do 3*4
and thensubtract it from previous, and update your memory and thensquare root the value
Similarly, codes can be written not in the order we calculate.
It is easier for us to read & write code in the order it is operated.
When we have composite function calls such as
The call sequence is x -> k() -> h() -> g() -> f().
It is rather easier to read, write and debug if we can write a code like:
Pipe operator & Function Chain
This is where pipe operator |> becomes handy in R.
The pipe operator does “and then” job, and it can be written as:
Style guide: use |> instead of %>%. Use shortcut Cmd (Ctrl) + Shift + M.
Sometimes you’ll see %>% operator instead, which comes from external library in R (magrittr), meanwhile |> is R native. In order to use %>%, external package library(magrittr) should be imported.
Exercise (challenge!)
Solve \(\sqrt{2^3}\) using pipe operator.
- First, solve above procedual way
- For square root, use
sqrt()function
- Next, solve using the pipe operator.
- Define function named
cubethat doesx^3 - Code should start with
2.
External packages
Packages are add-on libraries that extend the functionality of R.
- They provide additional functions, datasets, and tools for various tasks
- Can be easily installed with
install.packages() - And loaded in R session with
library()
Installing packages:
Load packages: you need to load packages to use its functionality.
- Need to load only once per session
Control Structure
Control Structure?
Control structure dictates which code gets executed and when.
- Conditional Statements:
ifstatements: Execute code if a condition is true.else/else ifstatements: Execute alternative code if the condition is false.
- Loops:
forloops: Repeat code block a specified number of times.whileloops: Continue executing code as long as a condition is true.
- Map (apply):
- Map a function to each element of a collection without explicitly writing loops.
If-else
The basic form of if and if-else statement in R:
Example 1: if and else executes code based on logical conditions.
[1] "The stock price has increased significantly!"
Example 2: If condition is not met, then nothing happens (skipped).
Example 3: else if checks one more logic condition:
[1] "The stock price has increased moderately."
Example 4: There can be multiple else if
[1] "The stock price has increased slightly."
Example 5: else is executed when all of if conditions are not met.
[1] "The stock price has decreased."
Logical Test Operators
- Less than (
<or>) - Less than or equal (
<=or>=) - Equality (
==) and Inequality (!=) - Logical NOT (
!) - AND (
&), OR (|)
Exercise
Write an if-else statement:
- If PMT > 1000, add 10000 to PMT (i.e.,
PMT <- PMT + 10000) - Else if PMT > 500, add 100 to PMT
- Else, set PMT = 0
What is the outcome of above if-else, if initial PMT was 750?
For loops
For loops are used when code has to be iterated a specified number of times.
If for loop was explictly written:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
To use number index for each element: use seq_along() in the loop.
x itself can be a iterable:
next and break
Generally used with if-else condition tests inside loop.
next is used to skip an iteration of loop.
break is used to exit loop immediately.
For loop: Compound interest
How to calculate compound interest over multiple years using a for loop?
- Principal: $10,000
- Interest rate: 5%
- Number of years: 10
[1] 10500.00 11025.00 11576.25 12155.06 12762.82 13400.96 14071.00 14774.55
[9] 15513.28 16288.95
Exercise
Based on the previous example, do the following:
Q1. Skip the first year using if and next
- Cashflow should have zero printed on the first slot
Q2. Stop the calculation if value exceeds $14,000
- Cashflow should have zero printed on slots that exceed value of $14,000
Exercise 2
Write a function that checks class of an input.
If the input is numeric, print “Numeric input!”, otherwise, print “Not numeric!”
- use
inherits(x, "numeric")for logical test.
Function mapping
map or apply is an implicit function loop.
- a function
fis an input arg formap() map()requirestidyverseorpurrrpackage- Succinct and easy to read than
forloops - Easy to use on data frames
Example: map()
- Output is always list
map function 2
If the desired output is not list but atomic vector:
map_dbl()a numeric (double) vectormap_chr()a character vectormap_lgl()a logical vectormap_int()an integer vector
Exercise
Generate
times_two()function that multiplies input by 2.map
times_twofunction over1:10Achieve same result with anonymous function instead.
Achieve same result with for loop.
Lab Exercise 1
Convert below for loop operation to map()
Financial Data Manipulation
data.frame class
One of the most important data class in R, built on top of list type
Stores data structure in 2D tabular form:
with rows (observations, or records)
and columns (variables)
columns can be different types!
# A tibble: 3 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
Creating a data.frame
- Creating a data.frame is similar to list
Exercise
Create a dataframe named as housing:
- 6 columns: Name, Age, Sex, Income, Housing, Zipcode
- Name: Amy, Bill, Charles, Donna, Eckert
- Age: 21, 25, 30, 38, 49
- Sex: Female, Male, Male, Female, Male
- Income: 36000, 53000, 89000, 82000, 166000
- Housing: “Rent”, “Rent”, “Own”, “Own”, “Rent”
- Zipcode: 12333, 12543, 11255, 12333, 33533
What are the type (class) of each column automatically recognized by R?
- Check with
str(housing).
Q: What should be their type (class) in theory?
Modern R Syntax
dplyr / tidyverse
- A modern, new approach that revolutionized R
- Very fast, written in C++
- Verbal and easy to read
- Developed by RStudio team (Posit)
dplyr verbs
Key verbs
- Verbs for the core functionality
- Those verbs are used with pipe operator
|>or|>
select(): select subset of columnsrename(): rename columnsrelocate(): change column positions
filter(): select subset of rows with conditionarrange(): reorder rowsmutate(): add new columns (variables)summarize(): generate summary table based ongroup_by()
dplyr::select()
Example 1:
Example 2:
Select from “Sepal.Length” (1st) to “Petal.Length” (3rd) column
Example 3:
Select columns except for specified columns
Convenience functions
Convenience functions are allowed within select().
Examples:
starts_with(),ends_with(),contains()…and lots more!
matches(),num_range(),all_of(),any_of(),everything(),last_col(),where()Check tidyselect documentation
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
dplyr::rename()
dplyr::relocate()
Rearrange the column order.
Non-mentioned columns retain original ordering.
dplyr::filter()
Example 1: Subset rows with conditions
# A tibble: 3 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Example 2: Multiple conditions
# A tibble: 3 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.26 Good D VS2 65.2 56 403 3.99 4.02 2.61
2 0.26 Good D VS1 58.4 63 403 4.19 4.24 2.46
3 0.26 Good E VVS1 57.9 60 554 4.22 4.25 2.45
- c.f.) If you want to filter rows based on row numbers, use
slice()
Class Exercise
From diamonds dataframe (load tidyverse for access),
Subset the dataframe with below conditions:
carat is equal to 0.26 and;
clarity is “VS2” and;
select columns “carat”, “cut”, “clarity”
then store it as
sub_diamonds
What is the dimension of
sub_diamonds? Check withdim(sub_diamonds)
dplyr::arrange()
Arrange, or sort the dataframe based on the specified column value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.4 2.9 1.4 0.2 setosa
3 4.4 3.0 1.3 0.2 setosa
- Can be arranged in descending order
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.9 3.8 6.4 2.0 virginica
2 7.7 3.8 6.7 2.2 virginica
3 7.7 2.6 6.9 2.3 virginica
Arranging with multiple variables: Hierachical ordering
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.4 2.9 1.4 0.2 setosa
3 4.4 3.0 1.3 0.2 setosa
4 4.4 3.2 1.3 0.2 setosa
5 4.5 2.3 1.3 0.3 setosa
It is incorrect to chain arrange function:
iris |> arrange(Sepal.Length) |> arrange (Sepal.Width)
because it resets ordering.
dplyr::mutate()
Compute transformation of variables and create new column.
Example: Z-score standardizing
\[ \frac{X - \bar{X}}{\sigma_X} = \frac{X - mean(X)}{sd(X)} \]
dplyr::summarize()
Many times we are interested in getting summary statistics for groups.
Summarizing is also called as data aggregation
Often used with
group_by(), to generate summary
Example: What is the average of “Sepal.Length” by “Species” in iris dataset?
# A tibble: 9 × 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 6.3 3.3 6 2.5 virginica
8 5.8 2.7 5.1 1.9 virginica
9 7.1 3 5.9 2.1 virginica
Above code shows the first 3 rows for each Species.
Example: What is the average of Sepal.Length by each group?
# A tibble: 3 × 2
Species avg_sepal_length
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
What if we want to summarize across every column?
- Example: Calculate average of across all columns by each group
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246
2 versicolor 5.94 2.77 4.26 1.33
3 virginica 6.59 2.97 5.55 2.03
Lab Exercise
From diamonds dataframe:
- How many observations (rows) have carat value greater than 3.1?
- What is the average of price of which carat equals to 1.0 ?
- Use
filter()andsummarize()
- Use
- What is the minimum, average, maximum of the price by cut?
- Use
min()andmax()
- Use
- How many observations are found by each cut?
- Use
n()
- Use
- Arrange the dataframe by carat (descending) and then price (ascending).
- Mutate a new column, named
xyz, which is mulplication of x and y and z. Store the dataframe asmy_diamond. - What is the maximum value of xyz (
max(my_diamond$xyz))? - What is the correlation between carat and price by each cut?
- Use
cor(x,y)for correlation between x and y
- Use
map function on dataframe
Advanced example
Running repeated regressions and get coefficients, by each group.
- Useful for CAPM beta estimations
# A tibble: 3 × 3
# Groups: Species [3]
Species intercept slope
<fct> <dbl> <dbl>
1 setosa 2.64 0.690
2 versicolor 3.54 0.865
3 virginica 3.91 0.902
Lab Exercise
Summarize Financial data
- Install
tidyquantpackage
- Download end-of-day stock price data using code below
Warning: package 'zoo' was built under R version 4.4.3
- How is the average daily trading volume calculated for each company(symbol)? You should get result like below:
| symbol | mean(volume) |
|---|---|
| AAPL | 98953731 |
| BAC | 52857474 |
| MSFT | 30654553 |
| TSLA | 133226380 |
- What is the average trading volume of each company during the peak COVID-19 Season (March 2020)? You should get:
- Hint:
filter()and useyear()andmonth()to date variable
| symbol | mean(volume) |
|---|---|
| AAPL | 285457836 |
| BAC | 128477227 |
| MSFT | 73304341 |
| TSLA | 287001136 |
Data Tidying
Tidy data
The real world data does not come clean, ready for your analysis.
You will learn a consistent data structure, i.e. Tidy data.
Data examples
Same data can be presented in various ways. The example data has 4 information:
- country
- year
- population
- Number of TB (tuberculosis) cases
Data (table1, table2, table3) is is available when you load tidyverse
Example 1
| country | year | cases | population |
|---|---|---|---|
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 172006362 |
| Brazil | 2000 | 80488 | 174504898 |
| China | 1999 | 212258 | 1272915272 |
| China | 2000 | 213766 | 1280428583 |
Example 2
| country | year | type | count |
|---|---|---|---|
| Afghanistan | 1999 | cases | 745 |
| Afghanistan | 1999 | population | 19987071 |
| Afghanistan | 2000 | cases | 2666 |
| Afghanistan | 2000 | population | 20595360 |
| Brazil | 1999 | cases | 37737 |
| Brazil | 1999 | population | 172006362 |
| Brazil | 2000 | cases | 80488 |
| Brazil | 2000 | population | 174504898 |
| China | 1999 | cases | 212258 |
| China | 1999 | population | 1272915272 |
| China | 2000 | cases | 213766 |
| China | 2000 | population | 1280428583 |
Example 3
| country | year | rate |
|---|---|---|
| Afghanistan | 1999 | 745/19987071 |
| Afghanistan | 2000 | 2666/20595360 |
| Brazil | 1999 | 37737/172006362 |
| Brazil | 2000 | 80488/174504898 |
| China | 1999 | 212258/1272915272 |
| China | 2000 | 213766/1280428583 |
Tidy data
Which data structure is easier to work with, for general purpose?
Tidy data
A definition of tidy data:
- Each variable is a column;
- Each observation is a row;
- Each value is a cell; each cell is a single value.
Tidy data
When data is tidy, it is generally easier to work with the data.
- However, sometimes you’ll need to pivot (reshape) the data for your analysis.
Example
If you want to generate a rate of TB per 10,000 population on table1:
# A tibble: 6 × 5
country year cases population rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
It is not simple to operate this with table2 setting.
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Quick exercise
Load package tidyverse.
table1 will be ready for access.
- What is the average population of each country? (Hint:
group_by()andsummarize()) - What the total sum of TB cases of all years by each country?
- What is the average percentage ratio of TB cases over population by each country?
Pivot data
For your analysis, you will need to pivot the data (aka reshape):
to
longerform (less variables and more observations)or to
wideform (more columns and less rows)
Whenever you pivot the data, think about columns that are affected, and the names and values.
Billboard example
Let’s take a look at billboard dataset which is in wide form.
Pivot longer
Each observation is a song, and we have 76 columns that describe rank.
To tidy up, we want “rank” variable in column that stores the number.
How can we pivot the data so that we have rank in one column?
Simple illustration on how pivot_longer() works:
Pivot Billboard data
| artist | track | date.entered | week | rank |
|---|---|---|---|---|
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk1 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk2 | 82 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk3 | 72 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk4 | 77 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk5 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk6 | 94 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk7 | 99 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk8 | NA |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk9 | NA |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk10 | NA |
Exercise
- Load
billboarddata with
- Select
artist,trackthe columns that are “wk1” to “wk12”
- Browse
?num_range - Use
num_range()inselect()
- Pivot the data to longer form.
Pivot data: wider
pivot_wider() works in opposite way in that:
increases the number of columns (variables)
decreases the number of rows (observations)
Wider forms are common for machine learning.
- One hot encoding / dummy variables
Example
Previous billboard_longer data:
# A tibble: 5 × 5
artist track date.entered week rank
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
Pivot the data to wider form:
# A tibble: 5 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doors Do… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
4 3 Doors Do… Loser 2000-10-21 76 76 72 69 67 65 55 59
5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
# wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
Note that we need at least two inputs for pivot_wider().
Financial Data Example
Stock price data commonly comes in the form below:
| symbol | date | open | high | low | close | volume | adjusted |
|---|---|---|---|---|---|---|---|
| TSLA | 2023-01-03 | 118.47 | 118.80 | 104.64 | 108.10 | 231402800 | 108.10 |
| TSLA | 2023-01-04 | 109.11 | 114.59 | 107.52 | 113.64 | 180389000 | 113.64 |
| TSLA | 2023-01-05 | 110.51 | 111.75 | 107.16 | 110.34 | 157986300 | 110.34 |
| TSLA | 2023-01-06 | 103.00 | 114.39 | 101.81 | 113.06 | 220911100 | 113.06 |
| TSLA | 2023-01-09 | 118.96 | 123.52 | 117.11 | 119.77 | 190284000 | 119.77 |
| TSLA | 2023-01-10 | 121.07 | 122.76 | 114.92 | 118.85 | 167642500 | 118.85 |
Question: What if you wanted to have a column for each stock’s adjusted price, like below?
| date | TSLA | BAC | XOM |
|---|---|---|---|
| 2023-01-03 | 108.10 | 30.97426 | 96.07031 |
| 2023-01-04 | 113.64 | 31.55660 | 96.34993 |
| 2023-01-05 | 110.34 | 31.49189 | 98.50565 |
| 2023-01-06 | 113.06 | 31.80616 | 99.69629 |
| 2023-01-09 | 119.77 | 31.32551 | 97.83820 |
| 2023-01-10 | 118.85 | 31.53811 | 99.29942 |
# A tibble: 250 × 4
date TSLA BAC XOM
<date> <dbl> <dbl> <dbl>
1 2023-01-03 108. 31.0 96.1
2 2023-01-04 114. 31.6 96.3
3 2023-01-05 110. 31.5 98.5
4 2023-01-06 113. 31.8 99.7
5 2023-01-09 120. 31.3 97.8
6 2023-01-10 119. 31.5 99.3
7 2023-01-11 123. 31.8 100.
8 2023-01-12 124. 31.9 102.
9 2023-01-13 122. 32.6 102.
10 2023-01-17 131. 31.9 102.
# ℹ 240 more rows
Class Exercise
- Examine data structure of
table1andtable2available ontidyverse.
How can you transform table1 to table2?
Answer
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Lab problem
Examine data population available from tidyverse.
Pivot the data to generate output as below.
(Hint: Browse document file ?pivot_wider() and see names_prefix argument.)
| country | year_1995 | year_1996 | year_1997 | year_1998 | year_1999 | year_2000 | year_2001 | year_2002 | year_2003 | year_2004 | year_2005 | year_2006 | year_2007 | year_2008 | year_2009 | year_2010 | year_2011 | year_2012 | year_2013 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 17586073 | 18415307 | 19021226 | 19496836 | 19987071 | 20595360 | 21347782 | 22202806 | 23116142 | 24018682 | 24860855 | 25631282 | 26349243 | 27032197 | 27708187 | 28397812 | 29105480 | 29824536 | 30551674 |
| Albania | 3357858 | 3341043 | 3331317 | 3325456 | 3317941 | 3304948 | 3286084 | 3263596 | 3239385 | 3216197 | 3196130 | 3179573 | 3166222 | 3156608 | 3151185 | 3150143 | 3153883 | 3162083 | 3173271 |
| Algeria | 29315463 | 29845208 | 30345466 | 30820435 | 31276295 | 31719449 | 32150198 | 32572977 | 33003442 | 33461345 | 33960903 | 34507214 | 35097043 | 35725377 | 36383302 | 37062820 | 37762962 | 38481705 | 39208194 |
| American Samoa | 52874 | 53926 | 54942 | 55899 | 56768 | 57522 | 58176 | 58729 | 59117 | 59262 | 59117 | 58652 | 57919 | 57053 | 56245 | 55636 | 55274 | 55128 | 55165 |
| Andorra | 63854 | 64274 | 64090 | 63799 | 64084 | 65399 | 68000 | 71639 | 75643 | 79060 | 81223 | 81877 | 81292 | 79969 | 78659 | 77907 | 77865 | 78360 | 79218 |
| Angola | 12104952 | 12451945 | 12791388 | 13137542 | 13510616 | 13924930 | 14385283 | 14886574 | 15421075 | 15976715 | 16544376 | 17122409 | 17712824 | 18314441 | 18926650 | 19549124 | 20180490 | 20820525 | 21471618 |
| Anguilla | 9807 | 10063 | 10305 | 10545 | 10797 | 11071 | 11371 | 11693 | 12023 | 12342 | 12637 | 12903 | 13145 | 13365 | 13571 | 13768 | 13956 | 14132 | 14300 |
| Antigua and Barbuda | 68349 | 70245 | 72232 | 74206 | 76041 | 77648 | 78972 | 80030 | 80904 | 81718 | 82565 | 83467 | 84397 | 85349 | 86300 | 87233 | 88152 | 89069 | 89985 |
| Argentina | 34833168 | 35264070 | 35690778 | 36109342 | 36514558 | 36903067 | 37273361 | 37627545 | 37970411 | 38308779 | 38647854 | 38988923 | 39331357 | 39676083 | 40023641 | 40374224 | 40728738 | 41086927 | 41446246 |
| Armenia | 3223173 | 3173425 | 3137652 | 3112958 | 3093820 | 3076098 | 3059960 | 3047002 | 3036032 | 3025652 | 3014917 | 3002911 | 2989882 | 2977488 | 2968154 | 2963496 | 2964120 | 2969081 | 2976566 |
Financial Data
Data in Finance
Financial data encompasses information related to financial markets, instruments, and economic indicators. It includes:
- Stock prices
- Bonds and interest rates
- Commodities prices
- Real estate prices
- Exchange rates
- Financial statements
Types of Financial Data
- Time Series Data: Sequential data points over time, e.g., daily stock prices.
- Cross-sectional Data: Data at a single point in time across many entities, e.g., balance sheets of various companies.
- Panel Data: Combines time series and cross-sectional data, e.g., yearly revenue of several companies over a decade.
Financial Data Sources 1
- Exchanges: NYSE, NASDAQ
- Government Publications: FRED, U.S. Treasury, Economic reports
- Financial News Outlets: Bloomberg, Reuters
- Data Providers: Yahoo Finance, Quandl (NASDAQ Data Link)
Financial Data Sources 2
- SEC EDGAR: U.S. corporate filings, public companies’ financials
- World Bank: Global economic indicators, international trade
- U.S. Census Bureau: Provides demographic information, and social indicators
- Macroeconomic Indicators:
- Bureau of Economic Analysis (BEA): For U.S. economic accounts, including GDP and personal income.
- Federal Reserve Economic Data (FRED): A comprehensive database of U.S. financial and economic data.
- International Monetary Fund (IMF): For global financial stability reports, world economic outlooks, and international financial statistics.
Recap on Portfolio theory
Risk and Return
The risk-return tradeoff is a fundamental concept in finance
- Higher Risk: Greater potential for return but higher chance of loss
- Lower Risk: More predictable outcomes, but typically lower returns
Calculating Returns
Return Calculation for Individual Assets:
The holding period return of an asset (\(r\)) over a period is calculated using the formula:
\[ r = \frac{P_{end} - P_{begin} + D}{P_{begin}} \]
\(r\) is the rate of return for the holding period
\(P_{end}\) is the ending price of the asset
\(P_{begin}\) is the beginning price of the asset
\(D\) represents any dividends or income received during the period
Calculating Returns
- We don’t need to take care of \(D\) when we calculate return based on adjusted close price
- Typically holding period is set with the same frequency, such as minute, hourly, daily, weekly, or monthly.
Portfolio Return
The return on a portfolio is a weighted sum of the individual returns of the assets within the portfolio.
- Portfolio Return Formula:
The return of a portfolio (\(r_p\)) is calculated as:
\[ r_p = \sum_{i=1}^{n} w_i r_i \]
\(r_p\) is the return of the portfolio over the period
\(w_i\) is the weight of asset \(i\) in the portfolio
\(r_i\) is the return of asset \(i\) over the period
\(n\) is the number of assets in the portfolio
Analytical Exercise
Suppose you have a portfolio consisting of two assets, A and B. The expected annual return for asset A is 8%, and for asset B, it’s 12%.
If 60% of your portfolio is invested in asset A and 40% in asset B,
Calculate the expected annual return of the portfolio.
Stand-Alone Risk
Stand-alone risk considers the risk of a single asset independently
- Measured by Standard Deviation of historic period returns
The formula for risk (\(\sigma\)) is:
\[ \sigma = \sqrt{\frac{\sum (r_t - \bar{r})^2}{T - 1}} = sd(r) \]
Where:
- \(r_t\) are the returns in period \(t\)
- \(\bar{r}\) is the average return
- \(T\) is the number of periods
Portfolio Risk
Portfolio risk involves the risk associated with holding a portfolio of assets.
- Not simply the weighted average of the stand-alone risks of portfolio assets
- Influenced by correlation between returns of assets
Analytical Exercise
Suppose you have a portfolio consisting of two assets, A and B. The expected annual return for asset A is 8%, and for asset B, it’s 12%.
The annual standard deviation of returns for asset A is 10% (0.1), and for asset B, it’s 15% (0.15). The correlation coefficient between the returns of assets A and B is 0.5.
If 60% of your portfolio is invested in asset A and 40% in asset B,
Calculate the expected annual risk of the portfolio.
\[ σ_p^2=w_A^2σ_A^2+w_B^2σ_B^2+2w_Aw_Bσ_Aσ_BρXY \]
Portfolio Risk
Empirical approach
At each time \(t\), calculate the return of a portfolio (\(r_p\))
\[ r_{p,t} = \sum_{i=1}^{n} w_{i,t} r_{i,t} \]
Then calculate standard deviation of portfolio returns (\(\sigma_p\)) over \(t\)
\[ \sigma_p = sd(r_p) = \sqrt{\frac{\sum (r_{p,t} - \bar{r_p})^2}{T - 1}} \]
Walk-through Exercise
Step 1: Get stock prices
# A tibble: 6 × 3
symbol date adjusted
<chr> <date> <dbl>
1 TSLA 2020-01-02 28.7
2 TSLA 2020-01-03 29.5
3 TSLA 2020-01-06 30.1
4 TSLA 2020-01-07 31.3
5 TSLA 2020-01-08 32.8
6 TSLA 2020-01-09 32.1
Step 2: Period returns
# A tibble: 6 × 3
# Groups: symbol [1]
symbol date daily_return
<chr> <date> <dbl>
1 BAC 2020-01-03 -0.0208
2 BAC 2020-01-06 -0.00143
3 BAC 2020-01-07 -0.00660
4 BAC 2020-01-08 0.0101
5 BAC 2020-01-09 0.00172
6 BAC 2020-01-10 -0.00828
Step 3: Pivot to Wide-form
We will use pivot technique, to deviate from tidy form.
# A tibble: 6 × 4
date BAC TSLA XOM
<date> <dbl> <dbl> <dbl>
1 2020-01-03 -0.0208 0.0296 -0.00804
2 2020-01-06 -0.00143 0.0193 0.00768
3 2020-01-07 -0.00660 0.0388 -0.00818
4 2020-01-08 0.0101 0.0492 -0.0151
5 2020-01-09 0.00172 -0.0219 0.00766
6 2020-01-10 -0.00828 -0.00663 -0.00889
Drop row with missing values
Step 4: Generate portfolio return
# A tibble: 6 × 5
date BAC TSLA XOM port_ret
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-01-03 -0.0208 0.0296 -0.00804 0.000718
2 2020-01-06 -0.00143 0.0193 0.00768 0.00933
3 2020-01-07 -0.00660 0.0388 -0.00818 0.00623
4 2020-01-08 0.0101 0.0492 -0.0151 0.00924
5 2020-01-09 0.00172 -0.0219 0.00766 -0.00241
6 2020-01-10 -0.00828 -0.00663 -0.00889 -0.00809
Step 5: Calculate Portfolio risk (sd)
Recap on Capital Asset Pricing Model
CAPM
Derived from portfolio theory and optimization.
\[ E(r_i)=r_f+β_i(E(r_m)−r_f)\]
Where:
\(E(r_i)\) is the expected return of the investment
\(r_f\) is the risk-free rate
\(\beta_i\) is the beta of the investment
\(E(r_m)\) is the expected return of the market
\((E(r_m) - r_f)\) is known as the market risk premium
Beta
Beta is a measure of the sensitivity of an individual investment’s returns to the market.
- An indication of an asset’s risk relative to the systematic risk
\[ \beta_i = \frac{Cov(r_i - r_f, r_m-r_f)}{Var(r_m-r_f)} \]
- Same as coefficient from below simple linear regression \(\beta_1\)
\[ r_i - r_f = \beta_0 + \beta_1(r_m - r_f)+ e_i \]
In some specific cases (e.g., intraday frequency) beta is estimated with raw return (\(r_i\) and \(r_m\)), not excess returns (\(r_i - r_f\))
Alpha
- Alpha (\(\alpha\)) is a financial metric indicating the extra return compared to the return of a benchmark index
- In CAPM, \(\alpha\) is the intercept of the regression
- A positive \(\alpha\) indicates outperformance, while a negative indicates underperformance
Walk-through Exercise
Prepare a stock return, market return (index).
Then combine two dataframe (bind them in row-wise).
tickers <- c("MSFT")
stock_prices <- tq_get(tickers, from = '2020-01-01', to = '2023-12-31')
snp500 <- tq_get("SP500", get = "economic.data", from = "2020-01-01", to = "2023-12-31")
snp500 <- snp500 |> rename(adjusted = price) # rename column for binding
prices <- bind_rows(stock_prices, snp500)
prices |> glimpse()Rows: 2,049
Columns: 8
$ symbol <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT…
$ date <date> 2020-01-02, 2020-01-03, 2020-01-06, 2020-01-07, 2020-01-08, …
$ open <dbl> 158.78, 158.32, 157.08, 159.32, 158.93, 161.84, 162.82, 161.7…
$ high <dbl> 160.73, 159.95, 159.10, 159.67, 160.80, 162.22, 163.22, 163.3…
$ low <dbl> 158.33, 158.06, 156.51, 157.32, 157.95, 161.03, 161.18, 161.2…
$ close <dbl> 160.62, 158.62, 159.03, 157.58, 160.09, 162.09, 161.34, 163.2…
$ volume <dbl> 22622100, 21116200, 20813700, 21634100, 27746500, 21385000, 2…
$ adjusted <dbl> 152.5057, 150.6067, 150.9960, 149.6192, 152.0025, 153.9014, 1…
Pivot price data into wide form:
# A tibble: 6 × 3
date MSFT SP500
<date> <dbl> <dbl>
1 2020-01-02 153. 3258.
2 2020-01-03 151. 3235.
3 2020-01-06 151. 3246.
4 2020-01-07 150. 3237.
5 2020-01-08 152. 3253.
6 2020-01-09 154. 3275.
Generate stock and index returns
Prepare risk-free rate (use 3 month treasury bill rate)
The risk-free rate is in percentage term (%) and annualized.
Convert the rate into daily level
- Use 252 business days assumption
- Use simple division for daily conversion \(r_d = r_y/252\)
# A tibble: 6 × 4
symbol date price rf
<chr> <date> <dbl> <dbl>
1 DGS3MO 2020-01-01 NA NA
2 DGS3MO 2020-01-02 1.54 0.0000611
3 DGS3MO 2020-01-03 1.52 0.0000603
4 DGS3MO 2020-01-06 1.56 0.0000619
5 DGS3MO 2020-01-07 1.54 0.0000611
6 DGS3MO 2020-01-08 1.54 0.0000611
Convert to wide form:
# A tibble: 6 × 2
date DGS3MO
<date> <dbl>
1 2020-01-01 NA
2 2020-01-02 0.0000611
3 2020-01-03 0.0000603
4 2020-01-06 0.0000619
5 2020-01-07 0.0000611
6 2020-01-08 0.0000611
Join two dataframe column-wise
# A tibble: 3 × 4
date MSFT SP500 DGS3MO
<date> <dbl> <dbl> <dbl>
1 2020-01-02 NA NA 0.0000611
2 2020-01-03 -0.0125 -0.00706 0.0000603
3 2020-01-06 0.00258 0.00353 0.0000619
Then generate excess returns:
CAPM Estimation
Use lm() for linear regression fit.
Call:
lm(formula = MSFT_exret ~ Mkt_exret, data = capm_data)
Coefficients:
(Intercept) Mkt_exret
0.0005149 1.1718518
To browse summary of regression:
Call:
lm(formula = MSFT_exret ~ Mkt_exret, data = capm_data)
Residuals:
Min 1Q Median 3Q Max
-0.068986 -0.006335 -0.000456 0.006333 0.076456
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0005149 0.0003655 1.409 0.159
Mkt_exret 1.1718518 0.0251585 46.579 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01154 on 996 degrees of freedom
(45 observations deleted due to missingness)
Multiple R-squared: 0.6854, Adjusted R-squared: 0.6851
F-statistic: 2170 on 1 and 996 DF, p-value: < 2.2e-16
How to access and extract coefficient estimates?
(Intercept) Mkt_exret
0.0005148837 1.1718518320
[1] "double"
To extract beta estimate:
Discussions
Timeframe
CAPM estimate (or any other) is largely dependent on the timeframe of your choice.
Consider following:
- Estimate based on recent 1 month
- Estimate based on recent 3 month
- Estimate based on recent 12 month
Should they be similar? Usually not.
Exercise
Using our previous example, estimate the CAPM beta with the following:
- Assume today is 2023-12-31.
- Estimate based on recent 1 month data
- Estimate based on recent 3 month data
- Estimate based on recent 12 month data
Rolling Windows
Given a set “lookback” period, estimate should only use previous information.
To capture timely information at each period, rolling regressions are often performed.
Lab Problems
Problem 1
Generate stock prices with below.
Using adjusted daily closing prices, calculate the expected (average) return and risk of
a equal-weighted portfolio
30% on Apple and 70% on Microsoft
70% on Apple and 30% on Microsoft
Problem 2
Based on the portfolio return (70% on Apple and 30% on Microsoft), calculate the portfolio beta.
- Use daily periodicity.
- Use 3 month treasury rate for risk-free rate.
Data I/O
Read and Write
Importing data is crucial initial process for any data science project.
We will learn how to read external data to
R, indata.frameobject.Also how to write
data.frameinRto a local file.
CSV files
CSV: Comma-Seperated Values
A plain, human-readable
textdata fileMinimalistic, widely used
Typically opened with Excel, but it is not an excel file!
Since it is
text, R tries to “guess” the type of each column when importing
CSV Example
A csv representation of iris dataframe:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
When saved to csv file:
Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5,3.6,1.4,0.2,setosa
CSV files
Write and read csv
Many packages support writing/reading csv files;
- base R (
utilspackage): basic but slow readrfrom tidyverse: fast, functionalvroom,data.table: extremely fast and functional
We use readr package and will discuss I/O speed later.
CSV files
To write a data.frame to a csv file: write_csv()
To read a .csv file to a data.frame: read_csv()
Rows: 150 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Species
dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You can specify a downloadable url instad:
Rows: 398 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): origin, name
dbl (7): mpg, cylinders, displacement, horsepower, weight, acceleration, mod...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 9
mpg cylinders displacement horsepower weight acceleration model_year origin
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 18 8 307 130 3504 12 70 usa
2 15 8 350 165 3693 11.5 70 usa
3 18 8 318 150 3436 11 70 usa
4 16 8 304 150 3433 12 70 usa
5 17 8 302 140 3449 10.5 70 usa
6 15 8 429 198 4341 10 70 usa
# ℹ 1 more variable: name <chr>
xlsx files
To read a .xlsx file to a data.frame: read_excel() from readxl package
# A tibble: 6 × 11
`in million USD` `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Net Income/Starting Line 19658 31398 42206 47681 33448 33615
2 Depreciation & Amortiza… 11917 14760 15583 15888 17182 17297
3 Non-Cash Items -1474 129 262 3313 482 3110
4 Change in Working Capit… -1663 2126 -2706 -10712 -6198 -8906
5 Cash from Operating Act… 28438 48413 55345 56170 44914 45116
6 Change in Fixed Assets … -22491 -26871 -30975 -34271 -33669 -32952
# ℹ 4 more variables: `FY '15` <dbl>, `FY '16` <dbl>, `FY '17` <dbl>,
# `FY '18` <dbl>
Other data formats
There are other common data formats:
- ‘.dat’, ‘.sas7bdat’, ‘.dta’
- ‘.rds’: R native
- ‘.json’: flexible data structure (NoSQL)
- ‘.parquet’,‘.feather’
Data cleaning
janitor package offers simple variable name cleaner: clean_names().
# A tibble: 6 × 11
in_million_usd fy_09 fy_10 fy_11 fy_12 fy_13 fy_14 fy_15 fy_16 fy_17
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Net Income/Sta… 19658 31398 42206 47681 33448 33615 16551 8375 19848
2 Depreciation &… 11917 14760 15583 15888 17182 17297 18048 22308 19893
3 Non-Cash Items -1474 129 262 3313 482 3110 -750 -5313 -7921
4 Change in Work… -1663 2126 -2706 -10712 -6198 -8906 -3505 -3288 -1754
5 Cash from Oper… 28438 48413 55345 56170 44914 45116 30344 22082 30066
6 Change in Fixe… -22491 -26871 -30975 -34271 -33669 -32952 -26490 -16163 -15402
# ℹ 1 more variable: fy_18 <dbl>
Exercise 1 : Read
- Install and load new packages:
janitorandreadxl - Download the exxon mobile data to your directory.
- Read the statement file with
read_excel()and store asexxon_statement - Clean the variable names with
clean_names().
Exercise 2: Join
- Read
exxon_categories.csvfile withread_csv()asexxon_categories. - Left join
exxon_categoriestoexxon_statementas:
Exercise 3 : Cleaning and Tidying
- Rename variable
in_million_usdtoAccount. - Pivot to a long form and store as
exxon_longwith below arguments:
- use
starts_with() - names_to = “Year”
- names_prefix = “fy_”
- values_to = “M_USD”
Grammar of Graphics (ggplot)
Case 1: Single Categorical variable
Categorical variables (factors) take a predefined set of values.
- Ordered: size (Large, medium,…) grades (A, A-, …)
- Unordered: country, ethnicity
To visualize distribution of categorical variable, bar plots are often used.
geom_bar() and geom_col()
geom_bar() : When you need count of single category
geom_col() : When you need different Y
Rows: 170 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Account, Category, Year
dbl (1): M_USD
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Effective visualization
The end goal of the bar plot:
Adding color to Bar plots
To fill the color, use “fill”. For border color, use “color”.
Scaling Y axis
To change scientific notation to currency format: use scales package.
More ticks
To add more breaks (ticks) on Y axis, control n.breaks in continuous scale.
Axis Labels
To add axis labels and plot title, use labs() function.
total_amt_bycat |>
ggplot(aes(x = Category, y = Total_USD)) +
geom_col(fill = "lightblue", color = "black") +
scale_y_continuous(
n.breaks = 6,
labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
) +
labs(
title = "Exxon Mobile Total Revenue / Expenditure",
subtitle = "From Years 2009 - 2018",
x = "",
y = "Total Revenue / Expenditure ($)",
caption = "Source: Exxon Financial Statements"
)Adding Themes
By default, ggplot uses gray background. To change, you can use other built-in themes.
total_amt_bycat |>
ggplot(aes(x = Category, y = Total_USD)) +
geom_col(fill = "lightblue", color = "black") +
scale_y_continuous(
n.breaks = 6,
labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
) +
labs(
title = "Exxon Mobile Total Revenue / Expenditure",
subtitle = "From Years 2009 - 2018",
x = "",
y = "Total Revenue / Expenditure ($)",
caption = "Source: Exxon Financial Statements"
) +
theme_bw()Reordering by value
To reorder factor variable according to its value: fct_reorder() on the variable.
total_amt_bycat |>
ggplot(aes(x = fct_reorder(Category, -Total_USD) , y = Total_USD)) +
geom_col(fill = "lightblue", color = "black") +
scale_y_continuous(
n.breaks = 6,
labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
) +
labs(
title = "Exxon Mobile Total Expenditure",
subtitle = "From Years 2009 - 2018",
x = "",
y = "Total Expenditure ($)",
caption = "Source: Exxon Financial Statements"
) +
theme_bw()Case 2: Single Numeric variable
Numeric variable has infinite range of values.
- In finance, risk and returns, ratios (ROA)
Commonly used visualization is histogram
- Proper bin size is important for effective visualization.
stock_returns <- tq_get(c("AAPL", "KO"), from = "2020-01-01", to = "2023-12-31") |>
group_by(symbol) |>
arrange(symbol, date) |>
mutate(pct_ret = (adjusted / lag(adjusted) - 1) * 100) |>
ungroup() |>
select(symbol, date, pct_ret)
stock_returns |>
filter(symbol == "AAPL") |>
ggplot(aes(x = pct_ret)) +
geom_histogram() # default bins: 30`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
An alternative for numeric distribution is density plot.
- Uses kernel smoothing
- Bandwith (bw) choice is important
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Case 3: Numeric + Categorical
Visualizing two (or more) variables can be useful to show variable relationships.
- Boxplot, Violin, Density plots
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_ydensity()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Case 4: Two categorical
Barplots can be used to visualize two categorical variables.
- Use “fill” aesthetic to add one more dimension
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_col()`).
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_col()`).
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_col()`).
Case 5: Two numerical
Scatterplots and regressions can effectively visualize relationships.
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1 row containing non-finite outside the scale range (`stat_smooth()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Case 6: Time series
Time series plots have date variable on X axis.
Line plots are often used.
Time series: Candlestick
Candlestick price charts are often used as well:
# Plot candlestick chart
stock_price <- tq_get("AAPL", from = "2023-12-01", to = "2023-12-31")
stock_price |>
ggplot(aes(x = date, open = open, high = high, low = low, close = close)) +
geom_candlestick() +
labs(
title = "AAPL Candlestick Chart, December 2023",
x = "Date",
y = "Price (USD)") +
theme_bw() stock_price |>
ggplot(aes(x = as.factor(date), open = open, high = high, low = low, close = close)) +
geom_candlestick() +
labs(
title = "AAPL Candlestick Chart, December 2023",
x = "Date",
y = "Price (USD)") +
theme_bw() +
scale_x_discrete(
breaks = as.factor(stock_price$date[seq(1, length(stock_price$date), 3)])
)